1 Background

This report provides the analysis of RNA-Seq data from iron-depletion experiments in Giardia WB. 2 conditions, 1 control and three replicates for each. Analysis was done with edgeR.

2 Data preparation

Giardia WB reference genome and annotation were from GiardiaDB (v. 5.0). Annotation in GFF file was converted to GTF to mapping with STAR v. 020201, htseq-count was used to extract the raw counts from the mapped .BAM files.

2.1 Specify some pararmeters, directory and files

2.2 Data summary

TYDK is the control, NOFERRIC is TYDK without ferric (less iron), 50uM is TYDK without ferric and supplemented with 50 µM iron inhibitor (even less iron).

2.3 Buid DGE

Read and merge files containing gene expression counts.

2.4 Gene annotation

Read the pre-compiled annotation_tab, and add the annotation to the DGEList. There is no Name for the Giardia annotation.

2.5 Data filtering

Only genes with reasonable expression level should be included in the analysis. Keep genes with at least 2 count per million in at least 3 samples (replicates).

## keep
## FALSE  TRUE 
##  4333  5419

2.6 Library sizes

Plot number of reads matched per sample, and total for each gene across all samples. The high peak in the counts per genes are the ribosomal RNAs.

2.7 Normalization

The raw library sizes were normalized with a scaling factor for each sample that aims to minimize the fold changes between the samples for the majority of the genes. Ideally all scaling factors should be around 1. TMM the default method was used.

We can see the lib.size are quite equivalent, and normalization factors are alsp around 1.

Below we can compare the different libraries before and after normalizations. Relative Log Expression (RLE) plots are powerful tool for visalizaing unwanted variation in high dimensional data. Ideally all boxplots should be centered around 0 and be of similar width. Compare the boxplots before and after normaliztion, we can see the normalization worked well for this data.

3 Data exploration

3.1 Sample clustering

The MDS plot shows the unsupervised clustering of the samples. It’s a two-dimensional scatterplot that distances on the plot approximate the typical log2 fold changes between the samples based the top 500 most variable genes between each pair of samples. On this plot, we can see that samples from the same hours do cluster together. But batch effect can not be excluded, which will be discussed later.

3.2 Biological variation

The square root of the common dispersion gives the coefficient of variation of biological variation (BCV). The BCV indicates how the abundance of each gene varies between replicate samples. The common BCV for the data set:

## [1] 0.007432507

The number is low and good. A plot of the averge \(\log_2\)CPM versus the BCV for each gene is shown below. The plot visualises how gene variation between replicate samples (BCV) changes as gene abundance (\(\log_2\) counts per million) increase. We can see the trended dispersion shows a decreasing trend with expression level. Only the trended dispersion is used under the quasi-likelihood (QL) pipeline. ## Dispersion estimation The QL dispersions can be estimated using a genewise negative binomial generalized linear models with Quasi-likelihood test (glmQLFit), and then be visualized with the plotQLDisp function. The big sample size variation cause the QL dispersions are squeezed very heavily from the raw values. The data is transformed to \(\log_2\)CPM.

4 Differential expression analysis

QL F-test (glmQLFTest) were used to determine significant differential expression. The QL F-test is preferred as it reflects the uncertainty in estimating the dispersion for each gene. It also provides more robust and reliable error rate control when the number of replicates is small. ## Contrasts Comparisons of interest listed below. Contrast defines the null hypothesis as the comparison intersted to be equal to zero.

4.1 NoFerric

4.1.1 Log-fold change against log-counts per million

The test results are visualized in te following smear plot. Genes that are significantly DE with an FDR of 5% are highlighted in red and blue.

4.1.2 Heatmap of the moderated log-counts-per-million

Heatmap of the moderated log-counts-per-million of the DE genes in this comparison.

4.1.3 Volcano plot

A volcano plot the fold change on the x-axis and the statistial significance on the y-axis (the -log10 of the p-value. Genes that are highly dysregulated are farther to the left and right sides, while highly significant changes appear higher on the plot. Genes with FDR<0.05 are colored red, and abs(logFC)>1 are colored orange, and green if both.

4.1.4 Export significant cases

The table lists all the DE with abs(logFC) > 1 and FDR< 0.05.

4.1.5 Export logCPM values of the DE

Inspect the depth-adjusted reads per million for the top differentially expressed.

4.2 IronInh

4.2.1 Log-fold change against log-counts per million

The test results are visualized in te following smear plot. Genes that are significantly DE with an FDR of 5% are highlighted in red and blue.

4.2.2 Heatmap of the moderated log-counts-per-million

Heatmap of the moderated log-counts-per-million of the DE genes in this comparison.

4.2.3 Volcano plot

A volcano plot the fold change on the x-axis and the statistial significance on the y-axis (the -log10 of the p-value. Genes that are highly dysregulated are farther to the left and right sides, while highly significant changes appear higher on the plot. Genes with FDR<0.05 are colored red, and abs(logFC)>1 are colored orange, and green if both.

4.2.4 Export significant cases

The table lists all the DE with abs(logFC) > 1 and FDR< 0.05.

4.2.5 Export logCPM values of the DE

Inspect the depth-adjusted reads per million for the top differentially expressed.

4.3 IronInhvsNoFerric

4.3.1 Log-fold change against log-counts per million

The test results are visualized in te following smear plot. Genes that are significantly DE with an FDR of 5% are highlighted in red and blue.

4.3.2 Heatmap of the moderated log-counts-per-million

Heatmap of the moderated log-counts-per-million of the DE genes in this comparison.

4.3.3 Volcano plot

A volcano plot the fold change on the x-axis and the statistial significance on the y-axis (the -log10 of the p-value. Genes that are highly dysregulated are farther to the left and right sides, while highly significant changes appear higher on the plot. Genes with FDR<0.05 are colored red, and abs(logFC)>1 are colored orange, and green if both.

4.3.4 Export significant cases

The table lists all the DE with abs(logFC) > 1 and FDR< 0.05.

4.3.5 Export logCPM values of the DE

Inspect the depth-adjusted reads per million for the top differentially expressed.

4.4 Summary of DE genes

Genes that are DE are the ones with adjusted p-value (same as FDR in the table) less than a defined false discovery (FDR) rate. The adjusted p-value is used as opposed to the initial p-value as it has been adjusted for multiple testing. The FDR cutoff is set at 0.05. So genes with adjusted p-value < 0.05 is considered as DE, which were further classified into up- or down- regulated. Up-regulated genes has a positve log-fold change, while down-regulated genes has a negative log-fold change. DE with abs(logFC) > 1 was also show below.

4.5 Heatmap of the logFC

Heatmap of the logFC of the DE genes as well as heatmap of the top, 2000, genes.

4.6 Venn Diagram

Venn Diagram of the three most interested comparisons with multiple DE. Note the numbers do not agree between both venn diagrams. The second diagram shows only those genes that were either up in both samples or down in both samples. The first diagram also includes genes that were up in one sample and down in the other, which is a less restrictive criterion. And the total numbers in each contrast do add up the same.